Insurance redlining -- a complete example continued

12/05/2019

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

Introduction

  • Recall, we are investigating a complicated question:

    • were insurance companies were applying discriminatory business practices on majority non-white neighborhoods in Chicago,
    • or were their practices justifiable based on standard business practices, e.g.,
    • limiting access or raising prices based on justifiable crime statistics, etc?
  • The term “insurance redlining” refers litterally drawing a red line in a map, that excludes certain neighborhoods from services.

  • In the late 1970s, the US Commission on Civil Rights examined charges by several Chicago community organizations that insurance companies were redlining their neighborhoods.

  • Because comprehensive information about individuals being refused homeowners insurance was not available, the number of FAIR plan policies written and renewed in Chicago by zip code for the months of December 1977 through May 1978 was recorded.

  • The FAIR plan was offered by the city of Chicago as a default policy to homeowners who had been rejected by the voluntary market.

  • Information on other variables that might affect insurance writing such as fire and theft rates was also collected at the zip code level.

Chicago (chredlin) dataset

Map of Chicago neighborhoods Courtesy of Peter Fitzgerald CC BY-SA 3.0
  • The data that we have available for our analysis includes the variables:
    1. race – ethnic composition in percentage of minority;
    2. fire – fires per 100 housing units;
    3. theft – thefts per 1000 population;
    4. age – percentage of housing units built before 1939;
    5. involact – new FAIR plan policies and renewals per 100 housing units;
    6. income – median family income in thousands of dollars;
    7. side – north or south side of Chicago.
  • For reference, the South Side of Chicago has long had a reputation and cultural identity built around communities of working class African, Hispanic, Chinese, Irish, and Polish Americans, amongst others.
  • This is due, in part, to historic segregation measures which by the late 1970s were considered illegal.

The ecological fallacy

  • We note here that we do not actually know the ethnicity of the individuals who are denied insurance in this data set.

  • Rather, we only know the ethnic makeup of the zip code where an individual was denied.

  • This is an important difficulty that needs to be considered carefully before starting our analysis.

  • When data are collected at the group level, we may observe a correlation between two variables.

  • The ecological fallacy is concluding that the same correlation holds at the individual level.

  • Particularly, we only have aggregate data on the zip code level describing the ethnic composition of different neighborhoods, and the number of FAIR plan policies per neighborhood.

  • Even if there is a strong correlation between neighborhoods that have a high proportion of minority residents and a higher number of fair plans,

    • we cannot make the conclusion that minority individuals in these neighborhoods were disproportionately rejected on the normal insurance market.

Review of exploratory analysis

  • In our exploratory analysis, we noted that many of the variables are relatively skewed, with the entire first quartile of the response equal to zero.

  • We also saw some issues with multicollinearity in the variables, where the main predictor of interest race has some weak and some strong (anti)-correlations with other variables.

  • The main point of interest is to distinguish these effects, as much as possible, and to determine if there was any systematic bias against neighborhoods with high minority concentrations.

Exploratory analysis – continued

library(faraway)
library(ggplot2)
ggplot(chredlin,aes(race,fire)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-1

cor(chredlin$fire,chredlin$race)
[1] 0.5927956
  • When plotting fire versus race, we see a somewhat nonlinear but slightly positve trend between the neighborhoods of high minority concentration and higher rate of fires.

Exploratory analysis – continued

ggplot(chredlin,aes(race,theft)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-2

cor(chredlin$theft,chredlin$race)
[1] 0.2550647
  • When plotting theft versus race, we see a trend of slightly higher rate of theft with neighborhoods of higher minority concentration.

  • We note, also, this plot has an extreme outlier that may be important for the analysis – this might be skewing the trend positively in a way that isn't consistent with the overall behaviour of the city.

  • Genearlly, the correlation is much weaker than with fire, however.

Model selection – continued

  • Plotting median income versus the percent of minority in each neighborhood, we see that these variables are strongly anti-correlated:
ggplot(chredlin,aes(race,income)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-3

cor(chredlin$income,chredlin$race)
[1] -0.7037328
  • This is due, in part, to the historic segregation and inequality in Chicago, and the influence of these two variables can be very difficult to separate.

Summary of diagnostics

  • We will recall some of the diagnostics here.

  • We note that there is a structural issue arising due to the high number of zero FAIR plans (the response variable), but that there isn't a realistic choice of scale transformation to fix this.

  • Normality and variance assumptions of the error are otherwise OK.

  • We will probably need to evaluate the effect of points of high leverage in the analysis.

  • We wish thus to study the sensitivity of the main parameter of interest “race”, with respect to the leverage points and controlling for covariates in the analysis.

  • We will begin by looking at partial residual plots, which are used for nonlinearity detection when factoring out for covariates.

    • Specifically, we will evaluate the structural uncertainty of the model in terms of the hypothesis, \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \] is a valid form for the signal.
  • We will also look at the confidence intervals for the parameters.

Evaluating the uncertainty

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
lmod <- lm(involact ~ race + fire + theft + age + log(income), data=chredlin)
termplot(lmod, partial.resid=TRUE, terms=1,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-4

coefficients(lmod)[2]
       race 
0.009502223 
confint(lmod, parm = "race")
           2.5 %     97.5 %
race 0.004474458 0.01452999

Evaluating the uncertainty – continued

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=2,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-5

coefficients(lmod)[3]
      fire 
0.03985604 
confint(lmod, parm = "fire")
          2.5 %     97.5 %
fire 0.02215246 0.05755963

Evaluating the uncertainty – continued

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=3,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-6

coefficients(lmod)[4]
      theft 
-0.01029451 
confint(lmod, parm = "theft")
            2.5 %       97.5 %
theft -0.01598536 -0.004603655

Evaluating the uncertainty – continued

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=4,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-7

coefficients(lmod)[5]
      age 
0.0083356 
confint(lmod, parm = "age")
          2.5 %     97.5 %
age 0.002793968 0.01387723

Evaluating the uncertainty – continued

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=5,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-8

coefficients(lmod)[6]
log(income) 
  0.3457615 
confint(lmod, parm = "log(income)")
                 2.5 %   97.5 %
log(income) -0.4623041 1.153827

Evaluating the uncertainty – continued

  • For each of:
    1. race
    2. fire
    3. age
  • there isn’t a great deal of uncertainty in the effect on the response, and the structure appears to be fine in the partial residuals.
  • Log-income, as noted earlier, is not significant and the uncertainty of the parameter is noted by the confidence interval.

  • We make note, however, that while the confidence intervals for theft are small, the analysis may be strongly affected by the observation of high leverage.

    • Indeed, the coefficient for theft is negative, and this value may be strongly affected by the extreme zip code.
  • This is something we will consider later in the analysis.

Evaluating the uncertainty – continued

  • With the generally good diagnostics, in terms of Gaussianity and constant variance of the errors (with some qualifications) we might make some conclusions.

    • That is, we can put some faith in these p-values and confidence intervals, but this should always be qualified.
  • Loosely speaking, it appears that there is a small, but statistically significant, effect in which neighborhoods with higher concentrations of ethnic minorities have a higher number of FAIR plans, when all other variables are held equal.

  • This effect is not nearly as strong as e.g., the rate of fires with all other variables held equal.

    • However, this indicates that there is a small, but non-random effect where neighborhoods have higher number of FAIR plans based on a variable that shouldn't matter in an insurance application decision.
  • Our analysis is not even nearly complete, and we cannot say if any application for insurance was rejected based upon their ethnic background;

    • nonetheless, this corroborates the claims of the Chicago neighborhood organizations who filed their civil rights lawsuit – this warrants further investigation of the overall effect.

Evaluating the sensitivity of parameters

  • In terms of analyzing the sensitivity of the race parameter with respect to the inclusion or exclusion of other covariates, we can automate this to determine if the effect on the response changes dramatically.

    • This is a larger kind of uncertainty in the model selection itself;
    • if e.g., the parameter changed sign with a different choice of variables, we would have very contradictory explanations of the effect of the concentration of minorities on the number of FAIR plans.
  • This is similar to the case when we used matching in the voting districts in New Hampshire, where we needed to see if an additional covariate was acting as a confounding variable on our analysis.

  • By analyzing the inter-model stability of this parameter, we have a better sense of if this explanation is actually robust.

Evaluating the sensitivity of parameters – continued

  • Suppressing the code to automate this (in Faraway 2nd Edition) we will compare the parameter value and associated p-value for race in each combination of the variables:
                                            beta   pvalue
race                                        0.0139 0.0000
race + fire                                 0.0089 0.0002
race + theft                                0.0141 0.0000
race + age                                  0.0123 0.0000
race + log ( income )                       0.0082 0.0087
race + fire + theft                         0.0082 0.0002
race + fire + age                           0.0089 0.0001
race + fire + log ( income )                0.0070 0.0160
race + theft + age                          0.0128 0.0000
race + theft + log ( income )               0.0084 0.0083
race + age + log ( income )                 0.0099 0.0017
race + fire + theft + age                   0.0081 0.0001
race + fire + theft + log ( income )        0.0073 0.0078
race + fire + age + log ( income )          0.0085 0.0041
race + theft + age + log ( income )         0.0106 0.0010
race + fire + theft + age + log ( income )  0.0095 0.0004
  • The above output shows the parameter for race and the associated p-values for all 16 models.

  • There is some variance in the magnitude of the effect, depending on the other variables we control for, but in no case does the p-value rise above \( 5\% \) or does the parameter change sign.

  • This suggests some uncertainty over the magnitude of the effect, we can be sure that the significance of the effect is not sensitive to the choice of adjusters.

Evaluating the sensitivity of parameters – continued

  • If we suppose that we were able to find models in which the composition of the neighborhood in terms of minorities was not statistically significant, our ability to control for other factors would be more difficult.

  • We would then need to consider more deeply which covariates should be adjusted for and which not.

  • This level of model analysis and selection is at the heart of many real world problems, and we are fortunate as modellers in this case that it is unnecessary.

  • In general, however, this is the reality of real problems you will encounter after this class.

  • Now, we want to analyze the effect of leverage and influence on the model and our interpretation.

  • We recall that there are several influential points, and we wish to determine if their inclusion or exclusion would radically change the interpretation for the parameter for race.

Evaluating the sensitivity of parameters – continued

  • We plot below, using the “dfbeta” function we used in another analysis, the change of the parameter for race when excluding any particular observation:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(dfbeta(lmod)[,2],ylab="Change in race coef", cex=3, cex.lab=3, cex.axis=1.5)
abline(h=0)

plot of chunk unnamed-chunk-9

  • The difference is negligible given the size of the parameter and scale of the response, so we are not especially concerned about the sensitivity of the parameter to observations.

Evaluating the sensitivity of parameters – continued

  • We note that we can also produce a similar plot from the “lm.influence” function, turning the output into a dataframe
diags <- data.frame(lm.influence(lmod)$coef)
  • Using this, we will plot the terms in ggplot2:
ggplot(diags,aes(row.names(diags), race)) + geom_linerange(aes(ymax=0, ymin=race)) + theme(axis.text.x=element_text(angle=90), axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold")) + xlab("ZIP") + geom_hline(yintercept = 0)

plot of chunk unnamed-chunk-11

Evaluating the sensitivity of parameters – continued

  • Likewise, using the diags dataframe, we can produce a two dimensional plot of the differences in each the fire and theft parameters when excluding one of the observations:
ggplot(diags,aes(x=fire,y=theft))+geom_text(label=row.names(diags)) +
  theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-12

Evaluating the sensitivity of parameters – continued

  • As in the earlier diagnostics, we see that the zip codes 60607 and 60610 stick out for high leverage and influence on the parameter values:
chredlin[c("60607","60610"),]
      race fire theft  age involact income side
60607 50.2 39.7   147 83.0      0.9  7.459    n
60610 54.0 34.1    68 52.6      0.3  8.231    n
  • These are both unusually high fire and high theft zip codes – we will remove these two observations simultaneously and see the effect on the paramters:
lmode <- lm(involact ~ race + fire + theft + age + log(income), chredlin,subset=-c(6,24))
sumary(lmode)
              Estimate Std. Error t value  Pr(>|t|)
(Intercept) -0.5767365  1.0800462 -0.5340   0.59638
race         0.0070527  0.0026960  2.6160   0.01259
fire         0.0496474  0.0085701  5.7931 1.004e-06
theft       -0.0064336  0.0043489 -1.4794   0.14708
age          0.0051709  0.0028947  1.7863   0.08182
log(income)  0.1157028  0.4011132  0.2885   0.77453

n = 45, p = 6, Residual SE = 0.30320, R-Squared = 0.8
  • Their exclusion actually makes theft and age no longer significant, though race remains significant.

Evaluating the sensitivity of parameters – continued

  • We have verified that our conclusions are also robust to the exclusion of one or perhaps two cases from the data.

    • If our conclusions depended on including or excluding only one or two observations we would need to be particularly sure of these measurements.
    • In some situations, we might wish to drop such influential cases but this would require strong arguments that such points were in some way exceptional.
  • In any case, it would be very important to disclose this choice in the analysis, and any suppression of the information would be extremely dishonest.

Evaluating predictive power

  • Although this is a model designed for explanation purposes, for completeness in the example, we will look a the predictive power of the model:
x <- model.matrix(lmod)
x0 <- apply(x,2,median)
x0 <- data.frame(t(x0))
# note here that R won't accept the variable that we've rescaled until we re-name it back to its original name... 
colnames(x0)[6] <- "income"
predict(lmod,new=x0,interval="prediction")
          fit       lwr     upr
1 0.003348934 -1.398822 1.40552
predict(lmod,new=x0,interval="confidence")
          fit       lwr      upr
1 0.003348934 -1.225333 1.232031
  • We verify, due to the width of the confidence intervals for the mean and new observations on the median, that the model is worthless for predictions…

Qualifying our analysis: a hypothetical example

  • We have shown very robust evidence for the statistically significant effect of the neighborhood's ethnic composition, with all other variables held equal, having a significant effect in determining the number of FAIR plans in the neighborhood.

  • This doesn't mean conclusively, however, that there was systematic discrimination.

  • If we tried another hypothetical model, supressing theft and the two high influence observations:

modalt <- lm(involact ~ race+fire+log(income),chredlin, subset=-c(6,24))
sumary(modalt)
              Estimate Std. Error t value  Pr(>|t|)
(Intercept)  0.7532557  0.8358798  0.9012   0.37277
race         0.0042061  0.0022757  1.8483   0.07178
fire         0.0510220  0.0084504  6.0378 3.822e-07
log(income) -0.3623825  0.3191620 -1.1354   0.26279

n = 45, p = 4, Residual SE = 0.30919, R-Squared = 0.79
  • the parameter for race is no longer significant.

  • This is just a hypothetical example, because our relatively exhaustive analysis has shown that this is a cherry-picked model, versus the other possible models.

Qualifying our analysis – continued

  • Generally, however, this shows some of the subtlety in statistical analysis like this

    • particularly, independent analyses can lead to a different sequence of models, remediation and refinement, possibly to different conclusions.
  • Part of robust analysis is thus to perform several model selections and the associated diagnostics to deal with the overall uncertainty in the model selection;

    • reasonably we should go back and perform the same steps without log income, for example.
  • The strength of our conclusions will increase if we can show that several different models all have similar interpretations.

  • If there are several reasonable models with different interpretations or conclusions, there is a fundamental issue in the data, and we should be cautious and be skeptical if there is really a signal to be found in the data.

    • A modeler who lacks scientific integrity can similarly explore a large number of models but report only the one that favors a particular conclusion;
    • this is extremely dishonest, and if discovered, often leads to the author being professionally discredited (their work not taken seriously and themselves unemployable).
    • This is likewiswe the case for industrial applications;
    • even if the work is not published, if the analyst consistently makes flimsy models that aren't demonstrably robust, they will be uemployable because they cannot help business operations in any meaninful way.

Qualifying our analysis – continued

  • Generally, we have performed rigorous analysis in the above (pending completion of additional analysis).

  • However, there are still several ways we should be cautious about our conclusions:

    1. Firstly is that this is based on aggregate data – we have assumed that the probability a minority homeowner would obtain a FAIR plan after adjusting for the effect of the other covariates is constant across zip codes.

      If this probability varies in a systematic way (as opposed to random noise about the signal) then our conclusions may be well off the mark.

    2. We have demonstrated statistical significance, but the practical effect is actually fairly small. The largest value of the response (percent FAIR plans) is only \( 2.2\% \) and most other values are much smaller.

      The predicted difference between \( 0\% \) minority and \( 100\% \) minority is about \( 1\% \) , so while we may be confident that some people were affected by discrimination, there may not be so many of them to call this systematic discrimination.

    3. There may be some other latent, confounding variable that we haven’t accounted for that could explain this small difference if included in the model.

    4. We may find other ways to aggregate the data, for which our conclusions will change.

Qualifying our analysis – continued

  • For example, if we fit two separate models for the North Side and South Side of Chicago, we can get very different results:
lmod <- lm(involact ~ race+fire+theft+age+log(income), subset=(side == "s"),chredlin)
sumary(lmod)
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9946613  1.6949252 -1.1768 0.256470
race         0.0093509  0.0046055  2.0304 0.059285
fire         0.0510812  0.0170314  2.9992 0.008493
theft       -0.0102565  0.0090972 -1.1274 0.276184
age          0.0067778  0.0053061  1.2774 0.219700
log(income)  0.6810543  0.6493336  1.0489 0.309832

n = 22, p = 6, Residual SE = 0.34981, R-Squared = 0.76
  • In the South Side, the parameter for race is no longer significant…

Qualifying our analysis – continued

  • while in the North Side, it is:
lmod <- lm(involact ~ race+fire+theft+age+log(income), subset=(side == "n"),chredlin)
sumary(lmod)
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7902939  1.7793768 -0.4441  0.66196
race         0.0133200  0.0053903  2.4711  0.02310
fire         0.0246654  0.0154219  1.5994  0.12623
theft       -0.0079383  0.0039815 -1.9938  0.06073
age          0.0090040  0.0046471  1.9375  0.06769
log(income)  0.1666853  0.6233641  0.2674  0.79204

n = 25, p = 6, Residual SE = 0.35115, R-Squared = 0.76
  • Generally, sub-dividing data into smaller and smaller groups, we can dillute the significance.

  • However, as noted before, aggregating data too much causes its own issues – the balance between these is contextual and subjective based on our understanding of the task at hand.

Qualifying our analysis – continued

  • In this context, we might explain the difference in terms of the demographics of the North Side and the South Side:
ggplot(chredlin,aes(side,race)) + geom_point(position = position_jitter(width = .2,height=0)) + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-20

summary(chredlin$race[chredlin$side=="n"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.80   10.00   21.95   24.50   94.40 
summary(chredlin$race[chredlin$side=="s"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   34.27   48.10   49.80   72.92   99.70 
  • With the South Side having a much higher concentration of minorities, the explanatory power of this variable has less effect in the south side, in the presence of other latent variables.

Qualifying our analysis – continued

Population map of race and ethnicity in Chicago 2010 Courtesy of Eric Fischer CC BY-SA 2.0
  • A visualization of the demographics of Chicago is to the left, where the North Side/ South Side divide is clearly expressed.
  • This is a visualization of the 2010 census, where each dot represents 25 residents.
  • The color coding is as follows:
    1. Red: non-Hispanic Caucasian
    2. Blue: African American
    3. Orange: Hispanic
    4. Green: Asian
    5. Yellow: other

Summary of the complete example

  • We performed initial, exploratory analysis, checking for structure in the data, and in the plots of variables versus the response.

  • We fit a model based on a mix of criterion methods and some qualitative judgement.

  • We performed diagnostics to see how our assumptions might be breaking down – aside from an unavoidable issue with the many zero observations for the FAIR plan, it was a well functioning model.

  • We evaluated the partial residuals, and confidence intervals for each of the parameters in our model.

    • This gave us some indication of the uncertainty of this parameter and the structure of the model with respect to it.
  • As this is an explanatory model, we were interested in “predicting” the parameter for “race”;

    • we wanted to understand the sensitivity of this parameter, so we analyzed the p-value and the parameter value for qualitative differences versus all possible covariates in the data.
    • This illustrated a deeper level of the model uncertainty, though one which we seemed to be relatively robust to.
  • We also demonstrated robustness to the exclusion of observations, and checked formally for outliers that could affect the fit.

  • Just for fun, we saw that it has awful predictive power…

  • We made a final assessment of the uncertainty that we could not account for with our data.

Summary of the complete example – continued

  • After all this analysis, everything we say needs to be hedged with “ifs” and “buts” and qualified carefully.

  • This is the reality of statistical modeling, where causual conclusions are extremely difficult to draw.

  • To quote Farway, quoting Winston Churchill:

    “Indeed, it has been said that Democracy is the worst form of government, except all those other forms that have been tried from time to time.”

  • We might say the same about statistics with respect to how it helps us reason in the face of uncertainty.

    • It is not entirely satisfying but the alternatives are worse.

Missing data

  • Missing data causes both technical data processing and statistical challenges in regression analysis.

  • By creating a framework for different kinds of missing data, we have some mathematical tools for understanding how it will affect our analysis.

  • Certain kinds of missing data are harder than others to treat mathematically – at times, we won't have good tools for filling missing data values, or treating the implicit bias therein.

  • However, certain kinds of missing data can be modeled statistically, along with the uncertainty when treating these cases.

Types of missing data

  • Here is a summary of different missing data regimes we might find ourselves:

    1. Missing cases – this is typically the situation we are in during any statistical study.
      • Particularly, we must infer how the signal will generalize to the population at large, using observations of only a small sub-sample.
      • In special circumstances, if the cases that we don’t observe are not observed due to the signal we are studying, our sample will be biased.
    2. Incomplete values – often times when examining medical outcomes, we will not know the final outcomes of patients before the medical study ends.
      • In this situation, data on participants may still hold useful information but we have to deal with the fundamental incompleteness. Methods to handle this include survival analysis
    3. Missing values – it is quite often that samples will have some of the observations of the response or explanatory variables missing or corrupted.
      • Depending on the type of “missingness” we will have different tools to handle this.

Types of missing values

  • The following types of missing values are distinguished as follows:

    1. Missing Completely At Random (MCAR)
      • The probability of any value being missing is the same for every sample.
      • In this case, there is no bias induced by how the values are missing, though we lose information on the signal.
    2. Missing At Random (MAR)
      • Here we suppose the probability of a value being missing depends on a systematic mechanism with the known explanatory variables.
      • E.g., in social surveys certain groups may be less likely to respond.
        • If we know what sub-group the sample belongs to, we can typically delete the incomplete observation provided we adjust the model for the group membership as a factor.
    3. Missing Not At Random (MNAR)
      • The probability that a value is missing from a sample depends on an unknown, latent variable that we don’t observe, or based on the response we wish to observe itself.
        • E.g., if an individual has something to hide that is embarrassing or illegal, they may quite often avoid disclosing information that would suggest so.
        • This is difficult, and often mathematically intractible to handle.
  • Amongst the above, when the data is MAR, we can adjust based on observed variables and therefore handle missingness and bias mathematically – we will focus on this situation.

A concrete example of missing values

  • If we wish to understand methods of treating missing data, we can take a data set and delete values to compare how conclusions might be affected by our treatment.

  • Prototypically, we will study the Chicago Insurance data once again, but with values missing at random.

library("faraway")
summary(chmiss)
      race            fire           theft             age       
 Min.   : 1.00   Min.   : 2.00   Min.   :  3.00   Min.   : 2.00  
 1st Qu.: 3.75   1st Qu.: 5.60   1st Qu.: 22.00   1st Qu.:48.30  
 Median :24.50   Median : 9.50   Median : 29.00   Median :64.40  
 Mean   :35.61   Mean   :11.42   Mean   : 32.65   Mean   :59.97  
 3rd Qu.:57.65   3rd Qu.:15.10   3rd Qu.: 38.00   3rd Qu.:78.25  
 Max.   :99.70   Max.   :36.20   Max.   :147.00   Max.   :90.10  
 NA's   :4       NA's   :2       NA's   :4        NA's   :5      
    involact          income      
 Min.   :0.0000   Min.   : 5.583  
 1st Qu.:0.0000   1st Qu.: 8.564  
 Median :0.5000   Median :10.694  
 Mean   :0.6477   Mean   :10.736  
 3rd Qu.:0.9250   3rd Qu.:12.102  
 Max.   :2.2000   Max.   :21.480  
 NA's   :3        NA's   :2       

Summarizing missing values

  • Before we saw how many NA values were in the dataset based on the explanatory varible, but we will also want to know which samples have missing values (and how many).
rowSums(is.na(chmiss))
60626 60640 60613 60657 60614 60610 60611 60625 60618 60647 60622 60631 
    1     0     1     1     0     0     0     0     1     1     0     0 
60646 60656 60630 60634 60641 60635 60639 60651 60644 60624 60612 60607 
    1     0     0     1     0     0     0     1     1     0     0     1 
60623 60608 60616 60632 60609 60653 60615 60638 60629 60636 60621 60637 
    0     1     1     0     1     0     0     0     1     0     1     0 
60652 60620 60619 60649 60617 60655 60643 60628 60627 60633 60645 
    0     1     0     1     1     0     0     1     0     0     1 
  • Here there is at most one value missing from each row; likewise, the missing data is basically evenly spaced in the data.

    • If there was a large number of missing values in a few cases, we could likely drop these samples without loss of information.
    • However, in this example, dropping the samples with NA's would delete 20 out of 47 of the cases.

Summarizing missing values

  • We can also view how clumped or dispersed missing values are graphically as a diagnostic:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
image(is.na(chmiss),axes=FALSE,col=gray(1:0))
axis(2, at=0:5/5, labels=colnames(chmiss), cex=3, cex.lab=3, cex.axis=1.5)
axis(1, at=0:46/46, labels=row.names(chmiss),las=2, cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-24

Deleting missing values

  • Suppose we favor a deletion approach to all samples with missing values.

  • We will compare the full model with all values versus the situation in which we delete missing values:

modfull <- lm(involact ~ .  - side, chredlin)
sumary(modfull)
              Estimate Std. Error t value  Pr(>|t|)
(Intercept) -0.6089790  0.4952601 -1.2296 0.2258512
race         0.0091325  0.0023158  3.9435 0.0003067
fire         0.0388166  0.0084355  4.6015     4e-05
theft       -0.0102976  0.0028529 -3.6096 0.0008269
age          0.0082707  0.0027815  2.9734 0.0049143
income       0.0245001  0.0316965  0.7730 0.4439816

n = 47, p = 6, Residual SE = 0.33513, R-Squared = 0.75

Deleting missing values – continued

  • On the other hand, when we fit with the samples with missing values deleted (the default for “lm”):
modmiss <- lm(involact ~ ., chmiss)
sumary(modmiss)
              Estimate Std. Error t value  Pr(>|t|)
(Intercept) -1.1164827  0.6057615 -1.8431 0.0794750
race         0.0104867  0.0031283  3.3522 0.0030180
fire         0.0438757  0.0103190  4.2519 0.0003557
theft       -0.0172198  0.0059005 -2.9184 0.0082154
age          0.0093766  0.0034940  2.6837 0.0139041
income       0.0687006  0.0421558  1.6297 0.1180775

n = 27, p = 6, Residual SE = 0.33822, R-Squared = 0.79
  • The standard error increases because the estimates are less precise, due to the loss of information.

Single imputation

  • We may consider thus to “fill-in” data into the missing values for various samples – this is known as data imputation.

  • One approach is to fill in values by something “representative” of the known population,

    • e.g., fill in by each variable mean.
(cmeans <- colMeans(chmiss,na.rm=TRUE))
      race       fire      theft        age   involact     income 
35.6093023 11.4244444 32.6511628 59.9690476  0.6477273 10.7358667 
  • However, we don't want to fill in values for the response, as this is what we are trying to model:
mchm <- chmiss
for(i in c(1:4,6)) mchm[is.na(chmiss[,i]),i] <- cmeans[i]
imod <- lm(involact ~ ., mchm)

Single imputation – continued

  • Then, we look at the model summary, based on the imputation of the means:
sumary(imod)
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.0708021  0.5094531  0.1390 0.890203
race         0.0071173  0.0027057  2.6305 0.012245
fire         0.0287418  0.0093855  3.0624 0.004021
theft       -0.0030590  0.0027457 -1.1141 0.272242
age          0.0060795  0.0032079  1.8952 0.065695
income      -0.0270917  0.0316782 -0.8552 0.397791

n = 44, p = 6, Residual SE = 0.38412, R-Squared = 0.68
  • In this case, there isn't just loss of fit, but also qualitative differences in the parameters.

  • Particularly, theft and age have lost significance in this model versus the iterations previously seen.

  • Also, the parameters themselves are smaller in magnitude, describing less “effect” in the signal overall.

Single imputation – continued

  • In this case, we see significant bias induced by the imputation (toward the mean values);

    • this shows how we usually will only consider mean imputation when the number of missing values is small relative to the full population.
  • In the case that there is a small number of missing values for a categorical variable, we can typically model the “missing-value” as a category of its own.

  • We can consider a more sophisticated approach for handling missing values using regression.

  • Particularly, if the variables are strongly (anti)-correlated with each other, there is information in the explanatory variables telling how they vary together (or oppostitely).

Single imputation – continued

  • Recall, for percent variables, it is sometimes useful to make a change of scale to the real line when this is a response.

  • The logit transformation is given as, \( y \mapsto \log\left(\frac{y}{1 - y}\right) \)

  • The “logit” and “ilogit” (inverse) map are available in the Faraway package.

  • We will model the percent ethnic minority in a zip code as regressed on the other variables with the missing cases removed:

lmodr <- lm(logit(race/100) ~ fire+theft+age+income,chmiss)
ilogit(predict(lmodr,chmiss[is.na(chmiss$race),]))*100
     60646      60651      60616      60617 
 0.4190909 14.7320193 84.2653995 21.3121261 
chredlin$race[is.na(chmiss$race)]
[1]  1.0 13.4 62.3 36.4
  • Two of the predictions are reasonable, but two are significantly off.

  • This can be performed for each of the explanatory variables, and while preferable to mean imputation, still leads to bias.

    • In a way, we can start over-fitting to our known values and loose the true variance in the population.

Multiple imputation

  • If we understand that the loss of the variation of the population is the issue with the earlier approaches, we can try to enforce some variance in the imputation mathematically.

    • If we re-introduce variation on the missing value, but only once, this would just be a less-optimal choice for the single imputation as we performed previously.
    • Instead, we will treat this as a re-sampling problem, and re-input multiple cases of perturbed values for the missing term that takes into account the uncertainty of this value.
  • We will create 25 different versions of the data “re-sampled” back with uncertainty for each missing value.

  • The known values will be the same, but we will have “m” different versions of the missing values, drawn from a “Bayesian posterior” estimate…

    • The function output of Amelia will include the “m” different datasets:

Multiple imputation – continued

library(Amelia)
set.seed(123)
chimp <- amelia(chmiss, m=25)
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9

-- Imputation 6 --

  1  2  3  4  5  6  7  8  9 10 11 12 13

-- Imputation 7 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43

-- Imputation 8 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83

-- Imputation 9 --

  1  2  3  4  5  6  7  8  9

-- Imputation 10 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

-- Imputation 11 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51

-- Imputation 12 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

-- Imputation 13 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 141 142 143

-- Imputation 14 --

  1  2  3  4  5  6  7

-- Imputation 15 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25

-- Imputation 16 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22

-- Imputation 17 --

  1  2  3  4  5  6  7  8  9

-- Imputation 18 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54

-- Imputation 19 --

  1  2  3  4  5  6  7  8  9 10

-- Imputation 20 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126 127 128 129 130

-- Imputation 21 --

  1  2  3  4  5  6  7  8

-- Imputation 22 --

  1  2  3  4  5  6  7  8

-- Imputation 23 --

  1  2  3  4  5  6  7  8  9 10 11 12 13

-- Imputation 24 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29

-- Imputation 25 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30

Multiple imputation – continued

  • All imputations are stored in the object that is output by the Amelia function:
str(chimp, 1)
List of 12
 $ imputations:List of 25
  ..- attr(*, "class")= chr [1:2] "mi" "list"
 $ m          : num 25
 $ missMatrix : logi [1:47, 1:6] FALSE FALSE FALSE FALSE FALSE FALSE ...
  ..- attr(*, "dimnames")=List of 2
 $ overvalues : NULL
 $ theta      : num [1:7, 1:7, 1:25] -1 0.01738 0.00786 -0.06694 -0.00626 ...
 $ mu         : num [1:6, 1:25] 0.01738 0.00786 -0.06694 -0.00626 -0.10637 ...
 $ covMatrices: num [1:6, 1:6, 1:25] 0.977 -0.517 0.609 0.532 0.506 ...
 $ code       : num 1
 $ message    : chr "Normal EM convergence."
 $ iterHist   :List of 25
 $ arguments  :List of 23
  ..- attr(*, "class")= chr [1:2] "ameliaArgs" "list"
 $ orig.vars  : chr [1:6] "race" "fire" "theft" "age" ...
 - attr(*, "class")= chr "amelia"

Multiple imputation – continued

  • To extract the imputations, we can select any particular list value from the object, extracting the field “imputations”:
str(chimp$imputations, 1)
List of 25
 $ imp1 :'data.frame':  47 obs. of  6 variables:
 $ imp2 :'data.frame':  47 obs. of  6 variables:
 $ imp3 :'data.frame':  47 obs. of  6 variables:
 $ imp4 :'data.frame':  47 obs. of  6 variables:
 $ imp5 :'data.frame':  47 obs. of  6 variables:
 $ imp6 :'data.frame':  47 obs. of  6 variables:
 $ imp7 :'data.frame':  47 obs. of  6 variables:
 $ imp8 :'data.frame':  47 obs. of  6 variables:
 $ imp9 :'data.frame':  47 obs. of  6 variables:
 $ imp10:'data.frame':  47 obs. of  6 variables:
 $ imp11:'data.frame':  47 obs. of  6 variables:
 $ imp12:'data.frame':  47 obs. of  6 variables:
 $ imp13:'data.frame':  47 obs. of  6 variables:
 $ imp14:'data.frame':  47 obs. of  6 variables:
 $ imp15:'data.frame':  47 obs. of  6 variables:
 $ imp16:'data.frame':  47 obs. of  6 variables:
 $ imp17:'data.frame':  47 obs. of  6 variables:
 $ imp18:'data.frame':  47 obs. of  6 variables:
 $ imp19:'data.frame':  47 obs. of  6 variables:
 $ imp20:'data.frame':  47 obs. of  6 variables:
 $ imp21:'data.frame':  47 obs. of  6 variables:
 $ imp22:'data.frame':  47 obs. of  6 variables:
 $ imp23:'data.frame':  47 obs. of  6 variables:
 $ imp24:'data.frame':  47 obs. of  6 variables:
 $ imp25:'data.frame':  47 obs. of  6 variables:
 - attr(*, "class")= chr [1:2] "mi" "list"

Multiple imputation – continued

  • We will fit a model over each of the versions, and try to find a best model over all perturbed datasets by an averaging.

    • Specifically, we will take the average of the parameters as: \[ \begin{align} \hat{\boldsymbol{\beta}}_j \triangleq \frac{1}{m} \sum_{j=1}^m \hat{\boldsymbol{\beta}}_{ij}, \end{align} \] where here each \( i \) represents one of the imputations.
  • In this case, we can estimate the overall standard error in terms of the variance of the parameters \( \beta \) derived over the different versions of the imputed data.

    • Specifically, the combined standard errors are given as, \[ \begin{align} s_j^2 \triangleq \frac{1}{m} \sum_{i=1}^m s_{ij}^2 + var\left(\hat{\boldsymbol{\beta}}_j\left( 1 + \frac{1}{m}\right)\right) \end{align} \] where the variance taken above is the sample variance over the parameters, derived by the imputation.

Multiple imputation – continued

  • The “mi.field” function will make these computations, though we automate the re-fitting of the models with the for loop below:
betas <- NULL
ses <- NULL
for(i in 1:chimp$m){
  lmod <- lm ( involact ~ race + fire + theft +age , chimp $ imputations [[ i ]])
  betas <- rbind ( betas , coef ( lmod ))
  ses <- rbind (ses , coef ( summary ( lmod )) [ ,2])
}
(cr <- mi.meld(q=betas,se=ses))
$q.mi
     (Intercept)        race       fire       theft         age
[1,]  -0.2378402 0.007730138 0.03546648 -0.00873242 0.007218719

$se.mi
     (Intercept)        race        fire       theft         age
[1,]    0.161498 0.002026873 0.008519401 0.004555294 0.002531306

Multiple imputation – continued

  • The t-stastistics can be computed similarly,
cr$q.mi/cr$se.mi
     (Intercept)     race     fire     theft      age
[1,]   -1.472713 3.813825 4.163025 -1.916983 2.851776
  • Here the results are fairly similar, though in this case theft is no longer a significant parameter.

  • Many more techniques exist studying missing data, and this is just an introduction on how to approach this with additional references in Faraway.